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Abstract — Achieving high efficiency with numerical kernels for 
sparse matrices is of utmost importance, since they are part of 
many simulation codes and tend to use most of the available 
compute time and resources. In addition, especially in large 
scale simulation frameworks the readability and ease of use 
of mathematical expressions are essential components for the 
continuous maintenance, modification, and extension of software. 

In this context, the sparse matrix-matrix multiplication is of 
special interest. In this paper we thoroughly analyze the single- 
core performance of sparse matrix-matrix multiplication kernels 
in the Blaze Smart Expression Template (SET) framework. We 
develop simple models for estimating the achievable maximum 
performance, and use them to assess the efficiency of our 
implementations. Additionally, we compare these kernels with 
several commonly used SET-based C++ libraries, which, just as 
Blaze, aim at combining the requirements of high performance 
with an elegant user interface. 

For the different sparse matrix structures considered here, we 
show that our implementations are competitive or faster than 
those of the other SET libraries for most problem sizes on a 
current Intel multicore processor. 

I. Motivation 

Various popular simulation algorithms in high performance 
computing (HPC), such as computational dynamics for rigid 
bodies, rely on sparse matrix-matrix multiplication (spMMM) 
as one of their computational kernels. Due to its central role 
in the applications and its computational complexity it is of 
vital importance to have highly optimized implementations. 
However, apart from performance, other metrics such as 
programmability, readability and foremost maintainability are 
crucial for a successful long-term software development effort. 
Yet these metrics usually play only a minor role in HPC 
software development. Although there exist several approaches 
to provide fast spMMM implementations, these libraries, like 
most HPC software efforts, strictly focus on high performance 
but do not so well in most other software metrics (see also JT]). 
This neglect especially endangers complex, long-term software 
developments due to impeded maintenance work. However, 
the maintainability of software should be of major interest: on 
average, 60% of the total software development costs is spent 
in maintenance, where long-term projects usually lean towards 
higher maintenance costs 0. Improving the maintainability 
immediately leads to less effort in software modification and 
extension and subsequently to fewer software defects. 

This realization is the driving force behind several C++ 



Smart Expression Template (SET) math libraries. These li- 
braries attempt to combine highly optimized math kernels for 
vector and matrix operations with the advantages of a domain- 
specific language. They include an intuitive formulation of 
mathematical operations, high readability, and easy modifica- 
tion of operations (see for instance Listing [T}. 

Listing 1: spMMM formulation in the Blaze SET math library. 

1 blaze :: CompressedMatr ix<double, rowMajor> A, B, C; 
ill . . . Initialization of matrices A and B 

3C = A * B; 



In this paper we focus on the optimization of the sequential 
spMMM algorithm in the Blaze SET library, and compare the 
resulting performance characteristics for two chosen sparse 
matrices with similar high-performance SET-based frame- 
works. It will be clear that such a comparison can only make 
sense when the analysis is performed over a wide range of 
problem sizes, which rules out an extensive survey of popular 
matrix collections. We recognize that such a survey would be 
desirable. In this work we prefer a deeper analysis with more 
insight, however, and leave the survey to future research. 

This paper is organized as follows. In Section [II] we give 
a short overview of other C++ math libraries that follow an 
approach similar to Blaze, before Section [HI] briefly summa- 
rizes the details of our benchmark platform and benchmarking 
strategy. In Section |IV] we describe basic tasks and necessary 
steps for spMMM, together with appropriate performance 
models. Here we also demonstrate the general suitability of the 
SET methodology for HPC in terms of performance and the 
advantages in terms of software development. The optimized 
kernels are benchmarked and compared to several other SET- 
based C++ math libraries in Section [V] Section [VI] concludes 
the paper and provides suggestions for future work. 

II. Related Work 

The C++ programming language provides the feature to 
directly overload mathematical operators, which enables a very 
intuitive application of mathematical operations. However, due 
to the necessary creation of a temporary in each operation, 
the performance of classic C++ operator overloading can- 
not compete with other approaches. A reputed solution are 
Expression Templates (ET), which due to lazy evaluation 



of the result promise optimized performance. The first ET- 
based C++ library for dense arithmetic was Blitz++ 0. This 
framework, written by the inventor of ETs, Todd Veldhuizen, 
has been recognized as a pioneer in the area of C++ template 
metaprogramming [4j. The Boost uBLAS library is one of 
the most widespread ET math libraries since it is distributed 
together with the Boost library 0. In contrast to Blitz++ it 
additionally provides sparse matrices and vectors. However, 
in [TJ we have demonstrated that the assumption that ETs 
are a performance optimization is not justified, and have 
introduced an improved solution by the Smart Expression 
Template (SET) methodology. Among other features, SETs 
encapsulate performance-optimized compute kernels like those 
provided by the BLAS and LAPACK standards. An early 
example for a SET library is Armadillo Q, which is restricted 
to dense linear algebra operations, but employs SETs to 
integrate BLAS and LAPACK for optimized performance. 
The same feature is provided by MTL4, which additionally 
includes sparse matrix operations. An alternative with similar 
functionality is the GMM++ library J8], which allows to use 
ATLAS Q as BLAS backend. Numerics involving dense and 
sparse matrices and vectors, the use of optimized kernels, and 
the advanced SET features of intrinsics-based vectorization of 
non-BLAS operations and automatic expression optimization 
are supported by the Eigen3 JTO] and the Blaze flTlJ libraries. 
In contrast to Eigen3, which provides optimized kernels for 
all basic operations, Blaze uses BLAS subroutines for BLAS 
level 2 and 3 operations. 

In UJ we have analyzed several of these ET implemen- 
tations in detail and have introduced the notion of SETs 
and our SET library Blaze in particular. In [12] we have 
extended our analysis to more ET-based libraries, focused 
on the optimization and vectorization capabilities for dense 
arithmetic, and presented performance results for the CG 
algorithm, which is fundamental for many applications. In 
this work we expand our analysis to sparse arithmetic and the 
sparse matrix-matrix multiplication (spMMM) in particular. 

Much work has been devoted to sparse matrix based al- 
gorithms and efficient implementations in the past. However, 
most publications deal with parallel sparse matrix-vector mul- 
tiplication ifPTl . lfl4"l . |[T5l . since it is of pivotal importance 
in solving sparse linear systems and sparse eigenvalue prob- 
lems. While there has also been substantial work on sparse 
matrix-matrix multiplication, it mostly deals with execution 
and communication efficiency in the parallel case 1(161 . ifTTl . 
Here, however, we only cover the sequential kernel any try to 
understand its features in a well-defined setting, and especially 
in the context of SET frameworks. Consequently, issues of 
load and communication balancing, which would be crucial 
in the parallel case, do not arise. 

III. Benchmark Platform and Test-Cases 

An Intel Sandy Bridge i7-2600 CPU was used for all bench- 
marks. Using only one of the four cores it runs at 3.8 GHz 
with 8 MB of shared L3 cache, 256 kB of L2 cache and 32 kB 
of LI data cache. The maximum achievable main memory 



bandwidth (as measured by the STREAM benchmark |[T8l ) 
is about 18.5 GB/s. For each non-zero element of a sparse 
matrix we store the value as double precision floating point 
number and an index as a 64-bit integral value. Since we 
concentrate on general sparse matrices and there is no vector 
gather instruction in current x86 designs, we do not utilize 
SIMD vectorization but run scalar code. This means that the 
CPU is capable of performing one double precision floating 
point multiplication and one double precision floating point 
addition as well as either two load or one load (LD) and one 
store (ST) instruction per cycle |[T9l . Therefore the theoretical 
peak performance is 7.6 GFlop/s. 

The benchmark platform runs an openSuse 12.1 and 
we use the GNU g++ compiler with the following com- 
piler flags: -Wall -Werror -ansi -pedantic -03 
-mavx -DNDEBUG 

We use the Blazemark benchmark suite, which ships with 
Blaze, for a direct comparison of the different libraries. It uses 
the same random seed for all libraries and care is taken that 
randomly generated numbers and structures are identical for 
all tested libraries. We extended the Blazemark to have the 
option to compare not only different libraries but also multiple 
implementations of the Blaze spMMM kernels. To make sure 
that all measured times are accurate the Blazemark runs short 
test-cases several times until the total runtime exceeds two 
seconds. Furthermore, each test is performed at least 5 times 
and the best result is taken as the final measurement. The 
number of floating point operations per second (Flops/s) for 
the spMMM are calculated as follows: The number of required 
multiplications is 

n-1 

aft * bjc, 

k=0 

where at is the number of non-zeros in the fc-th column of 
A, and bk is the number of non-zeros in the fc-th row of B. 
The number of additions required is always bounded by the 
required number of multiplications. We always use the worst 
case assumption to calculate the Flops, which means that the 
overall number of floating point operations is approximately 
twice the number of multiplications. 

Two different input matrices over a range of problem sizes 
are used to review the outcome of our analysis. The first test- 
case multiplies two five-band matrices, which are created by 
using a 5-point stencil resulting from a finite difference dis- 
cretization of a Dirichlet boundary value problem on a square. 
All graphs showing the result of multiplying two of these 
five-band matrices are marked with (FD). The second test- 
case uses two randomly generated quadratic matrices. For each 
matrix five random numbers are placed on random locations 
in each row. This allows for a good comparability between 
the two test-cases in terms of the fill rate of the matrices. 
Whenever a graph shows the outcome of the multiplication of 
two randomly generated matrices it is marked with (random). 



IV. Implementation and performance analysis of 

THE SPMMM KERNEL 

For a thorough analysis it has turned out to be convenient 
to split the spMMM kernel in two logically independent parts: 
The pure computation and the actual storing of the results. 

A. The pure spMMM computation kernel 

Looking only at the pure computation of the spMMM 
allows to implement this part of the kernel and be sure that 
it works at the highest possible performance without any 
interference of additional data accesses for storing the result. 
In Blaze we use implementations of the two well known 
formats "compressed sparse row" (CSR) and "compressed 
sparse column" (CSC) l20l . These formats usually show good 
performance for general matrices on general-purpose cache - 
based microprocessors. Both formats use an array of pointers, 
which provides an immediate access to a specific row (CSR) 
or column (CSC). 

The classic way of calculating a matrix-matrix product 
C = A * B is to perform a dot product-like operation between 
a row of A and a column of B for each element in the resulting 
matrix. To achieve optimal performance with this approach, 
the format should be CSR for matrix A and CSC for B, while 
the format of C is irrelevant. The problem is that both vectors 
are sparse and therefore the operation suffers from all known 
issues of sparse vector-vector multiplications. Furthermore the 
results of these "dot products" are zero most of the time. 
Optimizations such as unrolling or blocking, which would 
lead to increased computational intensity l2D . |[l4l . rely on 
exploiting specific matrix structures and will not be explored 
here. 

Another algorithm, optimized for a set of three CSR or three 
CSC matrices, was introduced in ll22l . As shown in Figure Q~|it 
multiplies each non-zero value a r;C of row r of matrix A with 
all non-zero entries b C)X of matrix B. The intermediate results 
are collected in a dense temporary vector, which is initially 
filled with zeros, by just adding each result to the current 
value at the position x of the temporary vector. If this is done 
for all non-zero values of row r of matrix A, the vector is 
a dense representation of the rth row of the resulting matrix. 
Note that the approach can also be applied to column-major 
matrices in the spMMM with three CSC matrices. 



temp [o~»o o i • j 



to* 



X 



• • • • 

























• 




< 


» 

• 


• 



• • • 
• • • 



providing a fallback to the "classic" algorithm. The effort to 
convert the format is linear in the number of non-zero entries. 
Therefore it is not necessary to implement a total of eight 
computation kernels for all eight possible combinations. It is 
sufficient to convert one of the two matrices to be able to use 
the row-major or column-major algorithm. 

In order to arrive at a realistic upper performance limit for 
our computational kernels we employ a simple bandwidth- 
based performance model ll23l . lETI : The maximum perfor- 
mance for a loop is 



P = max P, 



B r 



where 6 max is the bandwidth of the relevant data path in 
bytes/s and B c is the loop's code balance: 

Data traffic [Bytes] 

B c = 

Floating point operations [Flops] 

This model works well if the performance of the loop is 
dominated by the data transfers to and from a single data 
path. Other effects, such as dependencies, abundant branch 
mispredictions, or the general inability of a single core to 
saturate the bandwidth of some memory hierarchy levels, may 
cause significant deviations from the model l24l . However, it 
is still a valuable starting point for a loop-based performance 
analysis, since it provides a "light speed" estimate. We con- 
centrate here on modeling the more advanced implementations 
of the spMMM kernel, since the naive version is plagued by 
conditional branches and erratic access patterns, which are not 
easily modeled. 

Listing 2: The row-major computation kernel without storing the 
result to the matrix C. 

i void compute ( 

2 CSRMatrixS C, 

3 const CSRMatrixS A, 

4 const CSRMatrixS B ) 

5 { 

6 typeclef CSRMatrix : : const_iterator iterator; 
7 

8 // Estimate the number of elements in matrix C 

9 nnzEstimation ( C, A, B ) ; 

// Temporary vector to store the result row-wise 
std : : vector<double> tempi C. columns!), 0.0 ); 

// Loop over all rows of the target matrix 
for( std::size_t cy = 0; cy < Crows () ; ++cy ) 

{ 

iterator ait ( A. begin (cy) ); 
iterator const aend( A.end(cy) ); 

// Loop over the non-zero entries of the 
/ / current row of A 
f or ( ; ait!=aend; ++ait ) 



ABC 

Figure 1: Sketch of a spMMM with the row-major algorithm. 

In case one of the two matrices is available in CSR format 
and the other in CSC format it turns out to be more efficient 
to convert one of the matrices to the other format instead of 



{ 



std::size_t const indexA ( ait->index() ) ; 
double const valueA( ait->value() ) ; 

iterator bit ( B . begin ( indexA) ) ; 
iterator const bend ( B . end (indexA) ); 

// Loop over the non-zero entries of the 
// current row of B 



32 f or ( ; bit!=bend; ++bit ) 

33 { 

34 size_t const indexB ( bit->index() ); // LD 

35 

36 // Update value 

37 // LD + Mult + LD + ADD + ST 

38 temp[indexB] += valueA * bit->value ( ) ; 

39 } 

40 } 
41 

42 // Write result to matrix C 

43 // Reset all entries of vector temp to 0.0 



44 } 
45} 



Listing [2] shows the code for the row-major computation 
kernel. The inner loop between lines [32] and [39] has a code 
balance of 16 Bytes/Flop. We assume that the update to the 
temp [ ] vector causes a load and a store to the relevant 
memory hierarchy level, but ignore non-consecutive accesses, 
which would lead to excess data traffic. Hence, the predictions 
of the balance model must be seen as best-case values. Within 
the LI cache this leads to a maximum theoretical performance 
of 3800 MFlops/sec at 3.8 GHz clock frequency, whereas in 
memory the limit is 1 140 MFlops/sec. 

Figure [2] shows performance results versus problem size 
(number of matrix rows) for the 5-point finite difference sten- 
cil matrices. The row-major algorithm (CSR x CSR) clearly 
achieves the best results for CSR x CSR and even comes 
close to the theoretical performance of 1140 MFlops/sec 
beyond the L3 cache limit. Even if the right-hand side operand 
is given as a CSC matrix and is therefore internally converted 
to CSR (CSR x CSC (with conversion)), still about 50% of 
the original performance is achieved. The classic CSR x CSC 
kernel cannot compete with the the row-major approach due 
to the problems mentioned before. The fact that the row- 
major algorithm's performance only drops slightly for matrices 
that do not fit into the L3 cache anymore shows that the 
balance model is problematic for in-cache situations, and more 
advanced modeling techniques would be required |f24l|. All 
data of the left-hand side matrix is traversed with stride one. 
For the right-hand side operand the prefetcher can easily 
predict which data to load, thanks to the fixed five-band pattern 
of the matrix. 

Figure [3] shows the results for the test case which uses 
randomly generated spares matrices. The classic CSR x CSC 
algorithm is not influenced by the structure of the matrices 
and therefore shows the same bad performance we saw in 
Figure [2] The row-major approach clearly achieves better 
results. However, because of the random structure of the left- 
hand side operand the prefetcher does not work optimally for 
the right-hand side matrix; thus, performance goes down with 
growing problem sizes. The classic approach does not show 
any significant performance for problem sizes greater than 
N = 200. Compared to this the row-major approach shows 
a much better performance even for huge matrices that do not 
fit into the L3 cache anymore, and also if the right-hand side 
matrix hast to be converted to the other format. Due to the 
cache-unfriendly access patterns the calculated performance 
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Figure 2: Performance results of the pure computation (FD). 
The maximum theoretical performance beyond the L3 limit is 
1140 MFlops/sec. 



limits cannot nearly be reached for this matrix. 
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Figure 3: Performance results of the pure computation (random). 
The maximum theoretical performance beyond the L3 limit is 
1140 MFlops/sec. 

Note that the general guideline to have a regular matrix 
structure for best performance is valid predominantly in view 
of the left-hand side matrix A; the performance is largely 
independent of the structure of B. 

B. Storing the spMMM result 

The algorithm in Listing [2] only calculates all the entries 
for the result matrix, but never actually stores them to the 
matrix object. Therefore all further optimization is driven by 
the requirement to access the memory when storing the result 
in the most efficient way. 

In this context, estimating the number of non-zero entries 
in the resulting matrix is an essential aspect. It is of highest 
importance to prevent frequent dynamic memory allocations 
during the calculation. Therefore an estimation of the final 
number of non-zero entries is required that never underesti- 
mates and, if possible, only slightly overestimates the needed 
memory. We found that the number of multiplications required 
to perform the spMMM (see Hill) is a good estimate. Each 
intermediate result either takes a place which is still zero or 
is added to another intermediate result. Due to this fact the 
number is always equal or higher than the number of non- 



zeros in the resulting matrix. Using this estimation the memory 
allocation is only done once at the beginning of the kernel. 

Another performance-critical part is the interface for storing 
the values in the resulting matrix. Our implementation of the 
CSR/CSC formats provides two low-level functions for this. 
First the append function, which appends an entry. It is the 
programmer's responsibility to append values in increasing 
row order and, within each row, in increasing column order. 
The second function is finalize, which marks the end of 
a row after all values have been appended. It has to be called 
after each row and leaves the matrix in a consistent state (note 
that the CSC format is handled accordingly). Streaming the 
results in this way has the advantage that all the values are 
stored in one successive memory block, and the underlying 
data structure for the row access is only modified once per 
spMMM. 

We have shown above that the row-major algorithm (see 
Listing O is very efficient. It calculates a dense representation 
of each result row, which subsequently has to be stored in the 
sparse result matrix. However, the way the temporary vector 
is converted to a sparse row is crucial. A first alternative 
is a brute force approach, which iterates over the double 
values of the temporary vector and appends all non-zero values 
to the resulting matrix ("Brute Force"-double). To reduce 
the amount of memory that has to be traversed the second 
approach is to use an additional lookup vector, either of type 
bool ("Brute Force"-bool) or char ("Brute Force"-char). 
In the STL a std : : vector<bool> is implemented as a bit 
field ll25l and can therefore hold information for 512 positions 
per cache line instead of 8 doubles or 64 chars. Figure [4] 
shows the performance results for the CSR x CSR "brute 
force" kernels for the 5-point finite difference stencils and 
Figure [3] shows the corresponding results for the randomly 
created matrices. 




N 

Figure 4: Comparison of different "Brute Force" and "MinMax" 
kernels (FD) for the complete spMMM. 

Despite the fact that "Brute Force"-bool accesses the least 
memory it has to perform additional Boolean operations for 
each entry, which leads to the worst performance in both 
cases. Also in both cases the additional char vector increases 
the performance slightly compared with the "Brute Force"- 
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Figure 5: Comparison of different "Brute Force" and "MinMax" 
kernels (random) for the complete spMMM. 



double approach without die additional lookup vector. Also 
shown are our "MinMax" kernels, which basically do the same 
as the "Brute Force" kernels, but additionally keep track of 
the lowest and highest index of the non-zero entries in the 
temporary vector. Especially in the test-case with the five-band 
matrices this optimization gives a considerable performance 
boost. Notably, using the additional char vector hurts the 
performance of "MinMax" considerably. With the "MinMax" 
kernel each checked entry of temporary vector is more likely 
a non-zero value and therefore the advantage of the lookup 
vector is not big enough to compensate the extra effort. 

Even though the "MinMax" approach is better than "Brute 
Force," both influence the performance significantly. In addi- 
tion, the bigger the problem sizes the more the performance 
suffers compared to the pure computation kernel. With the 
problem size also the length of the temporary vector and 
the number of elements in the minimum-maximum range 
increases, but the absolute number of non-zeros does not 
change significantly. 

The next approach is to store all indices for non-zero 
elements within a row in a separate vector, which is usually 
small enough to fit into any cache level. After the complete row 
is calculated the few entries of the vector that hold the indices 
are sorted using std: :sort, and then only these positions 
of the temporary vector are appended to the resulting matrix. 
Figure [6] shows the performance results for the CSR x CSR 
for the five-point finite difference stencils with the sorting 
kernel (Sort) and Figure |7]illustrates the corresponding results 
for the test-case with the randomly generated matrices. It 
shows that the performance drawback of the sorting approach 
does not significantly increase with the problem size. 

For both test-cases the "MinMax" approach still performs 
better at small problem sizes. Hence, the final approach is 
to combine the "MinMax" and "Sort" kernels to the new 
"Combined" kernel. The decision which of the two storing 
strategies to use is performed for every single row. Note that 
it is more important that the decision can be done quickly than 
that it is precise, as it is performed for every row separately. 
The current implementation uses "MinMax" if its region is 
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Figure 6: Comparison of the "MinMax" approach with the "Sort" 
approach (FD) for the complete spMMM. 
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Figure 7: Comparison of the "MinMax" approach with the "Sort" 
approach (random) for the complete spMMM. 



smaller than twice the number of non-zero values in this row 
and "Sort" in all other cases. In Figure [7] the switch from 
"MinMax" to "Sort" is clearly visible between TV = 49 and 
N = 64. We found that as long as the storing method is not 
about to change, the "Combined" kernel is at most 5% less 
efficient than the kernels with only a single strategy. Overall, 
the "Combined" kernel reaches 35% of the pure computation 
performance for the CSR x CSR test case using the five-point 
finite difference stencils. 

All previously shown test-cases used matrices with a fixed 
number of non-zero entries in each row. This means that the fill 
ratio decreases with increasing problem size. The benchmark 
shown in Figure [8] uses the same matrix generation algorithm 
as for the random case, but the fill ratio is 0.1% for each row 
instead of the fixed five elements. With the increasing absolute 
number of non-zero entries in each row the fill ratio of the re- 
sult matrix increases. At N w 38000 the "MinMax" approach 
exceeds the performance of the "Combined" kernel, which 
uses the "Sort" storing strategy. At this point the fill ratio of the 
result matrix is 3.7% or about 1400 non-zero entries per row. 
For the "MinMax" kernel this means that on average every 
third cache line loaded actually contains one non-zero entry. 
Our conclusion is that there is a break-even point in terms of 



problem size for which the "MinMax" approach is faster than 
sorting because of the growing probability that loaded data is 
actually stored in the result matrix. 
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Figure 8: Comparison of the "MinMax" approach with the "Sort" 
approach, multiplying randomly generated matrices with a fixed 0.1% 
fill ratio. 

V. Performance Comparison of SET Libraries 

In this section we compare the performance of the Blaze 
library to other expression template based C++ libraries. We 
selected the most common libraries that provide the according 
kernels for the multiplication of two CSR matrices and the 
multiplication of a CSR and a CSC matrix. We use the Boost 
uBLAS library in version 1.51, MTL4 in version 4.0.8883 (open 
source edition), Eigen3 in version 3.1.1, and Blaze in version 
1.1, the latter employing the fastest "Combined" kernel from 
Section IIV-BI All libraries were benchmarked as given. We 
only present double precision results in MFlop/s graphs for 
each test case. For all in-cache benchmarks we make sure that 
the data has already been loaded to the cache. 

Figure [9] shows the comparison of the results of the 
CSR x CSR kernels for sparse matrices resulting from five- 
point finite difference stencils. The Blaze library achieves 
roughly twice the performance of Eigen3 and MTL4. uBLAS 
cannot compete with the others, since it abstracts from the 
actual storage order of the operands and traverses the right- 
hand side operand in a column-wise fashion despite it being 
stored in row-major order. It becomes apparent that with a 
proper implementation of the kernel the size of the matrix 
hardly influences the performance. Only a small drop can be 
observed for matrices that do not fit into the L3 cache anymore 
and have to be loaded from main memory. 

Figure [10] summarizes the results for the CSR x CSR 
kernels for randomly created sparse matrices. Again, Blaze 
shows a higher performance than the Eigen3 and MTL4 
libraries, and uBLAS falls far behind. In comparison to sparse 
matrices resulting from finite difference stencils, though, the 
performance clearly depends on the size of the matrix and 
degrades with growing matrix sizes. 

The results for the CSR x CSC kernels for sparse matrices 
resulting from five -point finite difference stencils are presented 
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Performance comparison for the CSR = CSR x CSR Figure 11: Performance comparison for the CSR = CSR x CSC 
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Figure 10: Performance comparison for the CSR = CSR x CSR 
benchmark (random). 



Figure 12: Performance comparison for the CSR = CSR x CSC 
benchmark (random). 



in Figure [TT] The performance of the Blaze and MTL4 libraries 
drop due to the creation of a temporary CSR matrix and 
converting the storage order of the right-hand side operand. 
The performance of Eigen3 slightly increases in comparison 
to the CSR x CSR kernel. Also the performance of the uBLAS 
library increases since the strategy of multiplying a row and 
a column fits the given storage orders. However, still the 
performance drops quickly with growing problem size and 
prohibits the multiplication of large sparse matrices. 

Finally, Figure[T2]shows the results for CSR x CSC kernels 
for random sparse matrices. Again, the performance of the 
Blaze and MTL4 libraries drop to the creation of a converted 
temporary and the performance of Eigen3 slightly increases. 
Consequently, the performance of Eigen3 can even surpass the 
Blaze performance for medium-sized matrices. For small and 
large sparse matrices Blaze exhibits the best performance. 

VI. Conclusion and Future Work 

We have conducted the first thorough performance analysis 
of several spMMM kernels on a modern standard processor. 
Employing a simple performance model we have demonstrated 
that our implementations can come close to the maximum 
predicted performance in the computational part of the kernel 
for out-of-cache situations with matrices leading to streaming 



memory access patterns. Due to further optimizations in the 
memory management and storage strategy, we can provide the 
currently fastest C++-based spMMM as part of the Blaze C++ 
library. Blaze combines high maintainability, which proves to 
be of essential importance for large scale software develop- 
ment, with HPC-grade performance that matches or exceeds 
the capabilities of other commonly used C++ math libraries. 

With the single core performance optimized the next step 
to improve the Blaze library is to include shared memory 
parallelization to exploit many- and multicore architectures. 
We expect that the typical contention and saturation effects 
seen with these architectures will add many new effects to 
the results presented here. Additionally, more work has to 
be invested in further improving the single core performance. 
Exploiting the given structure of the sparse matrix operands 
might be a possible approach. Alternative sorting algorithms 
which are better suited to sort short lists of unique integral 
numbers may also be advantageous. Finally, the decision 
criterion for which of the two storing strategies to use might 
be further improved. 
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